Gist

The site https://www.mathgenealogy.org/, contains over 276,000 observations of Mathematics PhD grads and their supervisors. This is effectively a geneology of mathematical supervision (which should have some sizable effect on thinking, topics, and reading). The R package ggenealogy contains an example dataset from this source and facilitates the consumption and ploting of this type of data.

Given that my thesis was just certified I want to try to see if I can trace up the mathematical genealogy tree to visualize my thought-leading predecessors.

Setup

library(ggenealogy)
library(ggplot2)
library(magrittr)

data("statGeneal", package = "ggenealogy")
df <- statGeneal %>%
  #dplyr::filter(parent != "") %>%
  tibble::as_tibble()
print(df, n=3)
## # A tibble: 8,165 x 6
##   child            parent             gradYear country     
##   <chr>            <chr>                 <dbl> <chr>       
## 1 Nicolas Chopin   "Christian Robert"     2003 France      
## 2 Melvin Springer  "Everett Welker"       1947 UnitedStates
## 3 Shelemyahu Zacks ""                     1962 UnitedStates
##   school                                     
##   <chr>                                      
## 1 Université Pierre-et-Marie-Curie - Paris VI
## 2 University of Illinois at Urbana-Champaign 
## 3 Columbia University                        
##   thesis                                                                        
##   <chr>                                                                         
## 1 Applications of Sequential Monte Carlo methods to Bayesian Statistics         
## 2 Joint Sampling Distribution of Mean and Standard Deviation for a Chi-square U~
## 3 Optimal Strategies in Randomized Factorial Experiments                        
## # ... with 8,162 more rows
hist(df$gradYear)

Ok, about 8k observations where “all the parent-child relationships where both parent and child received an advanced degree of statistics as of June 6, 2015.” This may or may-not contain the need people I am looking for.

Note that grad year:

  • Is .
  • Median is 5 greater than mean (left skew)

Where in the world is …?

Through trial and error I know that Di Cook is not in the data. The original paper does have Thomas Lumley, another professor of interest. But perhaps first I will manual look up Cook’s genealogy.

Di, Di’s supersivor, and “grand-supervisor” are not in the list, may have to go to plan B, looking at Thomas Lumley. After looking at both parents and children, I know that Thomas has 1 child in the data; Petra Buzkova. From the paper, we can see that the oldest predescor is David Cox.

lumley_p <- grepl("Lumley", df$parent, fixed = TRUE)
sum(lumley_p)
## [1] 1
df[lumley_p, ]
## # A tibble: 1 x 6
##   child         parent        gradYear country      school                  
##   <chr>         <chr>            <dbl> <chr>        <chr>                   
## 1 Petra Buzkova Thomas Lumley     2004 UnitedStates University of Washington
##   thesis                                                                        
##   <chr>                                                                         
## 1 Marginal Regression Analysis of Longitudinal Data with Irregular, Biased Samp~
## Prep the network info, more on this in `As network layout (iGraph)`.
ig <- dfToIG(df)

Finding a path

Let’s grab the paths while we are on the topic of names. Actually, if we go all the way to Buzkova, this is the example case in the paper.

pathCB <- getPath("David Cox", "Petra Buzkova", ig, df,
                  "gradYear", isDirected = FALSE)
plotPath(pathCB, df, "gradYear", fontFace = 4) +
  xlab("Graduation Year") +
  theme(axis.text = element_text(size = 10),
        axis.title = element_text(size = 10)) +
  scale_x_continuous(expand = c(0.1, 0.2))

Good, we have a start. We will want to find a way to traverse the hierarchy to find all of the ancestors without filling in the cousin nodes (or more preferably faintly filling them in). As an example poster, see https://www.mathgenealogy.org/posters/raich.pdf

Making trees

l <- plotAncDes("David Cox", df, mAnc = 1, mDes = 6, vCol = "blue") +
  labs(subtitle = "Interesting, but too many \n  cousins of Thomas Lumley")
r <- plotAncDes("Thomas Lumley", df, mAnc = 6, mDes = 1, vCol = "blue") +
 labs(subtitle =  "Not very interesting, \n  nb only 1:1 relationships")

library(patchwork)
l + r

plotPathOnAll(pathCB, df, ig, "gradYear",
              bin = 200, nodeSize = 1, pathNodeSize = 2.5,
              nodeCol = "darkgray", edgeCol = "lightgray", 
              animate = TRUE) ## plotly static interaction not animated.

As network layouts (iGraph)

ig <- dfToIG(df)
class(ig)
## [1] "igraph"
ig
## IGRAPH 8e8f3b0 UNW- 7123 8165 -- 
## + attr: name (v/c), weight (e/n)
## + edges from 8e8f3b0 (vertex names):
##  [1] Nicolas Chopin   --Christian Robert   Melvin Springer  --Everett Welker    
##  [3] Shelemyahu Zacks --                   James Sweeder    --                  
##  [5] Nino Kordzakhia  --                   Pavel Vanecek    --Zuzana Prášková   
##  [7] Shyamal De       --                   Thomas Willke    --                  
##  [9] Vasant Huzurbazar--                   Rita Engelhardt  --William Cumberland
## [11] Fred Andrews     --                   Arthur Albert    --                  
## [13] John Folks       --                   Arnold Goodman   --                  
## [15] William Pruitt   --                   Thomas Birkner   --                  
## + ... omitted several edges
getBasicStatistics(ig)
## $isConnected
## [1] TRUE
## 
## $numComponents
## [1] 1
## 
## $avePathLength
## [1] 2.801
## 
## $graphDiameter
## [1] 10
## 
## $numNodes
## [1] 7123
## 
## $numEdges
## [1] 8165
## 
## $logN
## [1] 8.871
plot(ig)

Session info

## Packages used
pkgs <- c("ggenealogy", "ggplot2")
## Package & session info
devtools::session_info(pkgs)
## - Session info ---------------------------------------------------------------
##  setting  value
##  version  R version 4.1.2 (2021-11-01)
##  os       Windows 10 x64 (build 19044)
##  system   x86_64, mingw32
##  ui       RTerm
##  language (EN)
##  collate  English_United States.1252
##  ctype    English_United States.1252
##  tz       Australia/Sydney
##  date     2022-06-09
##  pandoc   2.11.4 @ C:/Program Files/RStudio/bin/pandoc/ (via rmarkdown)
## 
## - Packages -------------------------------------------------------------------
##  package      * version date (UTC) lib source
##  askpass        1.1     2019-01-13 [1] CRAN (R 4.1.2)
##  base64enc      0.1-3   2015-07-28 [1] CRAN (R 4.1.1)
##  cli            3.3.0   2022-04-25 [1] CRAN (R 4.1.3)
##  colorspace     2.0-3   2022-02-21 [1] CRAN (R 4.1.2)
##  cpp11          0.4.2   2021-11-30 [1] CRAN (R 4.1.2)
##  crayon         1.5.1   2022-03-26 [1] CRAN (R 4.1.3)
##  crosstalk      1.2.0   2021-11-04 [1] CRAN (R 4.1.2)
##  curl           4.3.2   2021-06-23 [1] CRAN (R 4.1.2)
##  data.table     1.14.2  2021-09-27 [1] CRAN (R 4.1.2)
##  digest         0.6.29  2021-12-01 [1] CRAN (R 4.1.2)
##  dplyr          1.0.9   2022-04-28 [1] CRAN (R 4.1.3)
##  ellipsis       0.3.2   2021-04-29 [1] CRAN (R 4.0.5)
##  fansi          1.0.3   2022-03-24 [1] CRAN (R 4.1.3)
##  farver         2.1.0   2021-02-28 [1] CRAN (R 4.1.2)
##  fastmap        1.1.0   2021-01-25 [1] CRAN (R 4.1.2)
##  generics       0.1.2   2022-01-31 [1] CRAN (R 4.1.2)
##  ggenealogy   * 1.0.1   2020-03-04 [1] CRAN (R 4.1.3)
##  ggplot2      * 3.3.6   2022-05-03 [1] CRAN (R 4.1.3)
##  glue           1.6.2   2022-02-24 [1] CRAN (R 4.1.2)
##  gtable         0.3.0   2019-03-25 [1] CRAN (R 4.1.1)
##  htmltools      0.5.2   2021-08-25 [1] CRAN (R 4.1.1)
##  htmlwidgets    1.5.4   2021-09-08 [1] CRAN (R 4.1.2)
##  httr           1.4.3   2022-05-04 [1] CRAN (R 4.1.3)
##  igraph         1.3.1   2022-04-20 [1] CRAN (R 4.1.3)
##  isoband        0.2.5   2021-07-13 [1] CRAN (R 4.1.2)
##  jsonlite       1.8.0   2022-02-22 [1] CRAN (R 4.1.3)
##  labeling       0.4.2   2020-10-20 [1] CRAN (R 4.1.1)
##  later          1.3.0   2021-08-18 [1] CRAN (R 4.1.2)
##  lattice        0.20-45 2021-09-22 [1] CRAN (R 4.1.3)
##  lazyeval       0.2.2   2019-03-15 [1] CRAN (R 4.1.2)
##  lifecycle      1.0.1   2021-09-24 [1] CRAN (R 4.1.2)
##  magrittr     * 2.0.3   2022-03-30 [1] CRAN (R 4.1.3)
##  MASS           7.3-57  2022-04-22 [1] CRAN (R 4.1.3)
##  Matrix         1.4-1   2022-03-23 [1] CRAN (R 4.1.3)
##  mgcv           1.8-40  2022-03-29 [1] CRAN (R 4.1.3)
##  mime           0.12    2021-09-28 [1] CRAN (R 4.1.1)
##  munsell        0.5.0   2018-06-12 [1] CRAN (R 4.1.1)
##  nlme           3.1-157 2022-03-25 [1] CRAN (R 4.1.3)
##  openssl        2.0.2   2022-05-24 [1] CRAN (R 4.1.3)
##  pillar         1.7.0   2022-02-01 [1] CRAN (R 4.1.2)
##  pkgconfig      2.0.3   2019-09-22 [1] CRAN (R 4.1.2)
##  plotly         4.10.0  2021-10-09 [1] CRAN (R 4.1.2)
##  plyr           1.8.7   2022-03-24 [1] CRAN (R 4.1.3)
##  promises       1.2.0.1 2021-02-11 [1] CRAN (R 4.1.2)
##  purrr          0.3.4   2020-04-17 [1] CRAN (R 4.0.3)
##  R6             2.5.1   2021-08-19 [1] CRAN (R 4.1.1)
##  RColorBrewer   1.1-3   2022-04-03 [1] CRAN (R 4.1.3)
##  Rcpp           1.0.8.3 2022-03-17 [1] CRAN (R 4.1.3)
##  reshape2       1.4.4   2020-04-09 [1] CRAN (R 4.1.2)
##  rlang          1.0.2   2022-03-04 [1] CRAN (R 4.1.3)
##  scales         1.2.0   2022-04-13 [1] CRAN (R 4.1.3)
##  stringi        1.7.6   2021-11-29 [1] CRAN (R 4.1.2)
##  stringr        1.4.0   2019-02-10 [1] CRAN (R 4.1.2)
##  sys            3.4     2020-07-23 [1] CRAN (R 4.1.2)
##  tibble         3.1.7   2022-05-03 [1] CRAN (R 4.1.3)
##  tidyr          1.2.0   2022-02-01 [1] CRAN (R 4.1.2)
##  tidyselect     1.1.2   2022-02-21 [1] CRAN (R 4.1.2)
##  utf8           1.2.2   2021-07-24 [1] CRAN (R 4.1.2)
##  vctrs          0.4.1   2022-04-13 [1] CRAN (R 4.1.3)
##  viridisLite    0.4.0   2021-04-13 [1] CRAN (R 4.1.2)
##  withr          2.5.0   2022-03-03 [1] CRAN (R 4.1.2)
##  yaml           2.3.5   2022-02-21 [1] CRAN (R 4.1.2)
## 
##  [1] C:/Users/spyri/Documents/R/win-library/4.1
##  [2] C:/Program Files/R/R-4.1.2/library
## 
## ------------------------------------------------------------------------------